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ABSTRACT 

Next-generation workstations will have hardware support for digital "continuous media" (CM) 
such as audio and video. CM applications handle data at high rates, with strict timing require- 
ments, and often in small "chunks". If such applications are to run efficiently and predictably 
as user-level programs, an operating system must provide scheduling and IPC mechanisms 
that reflect these needs. We propose two such mechanisms: split-level CPU scheduling of 
lightweight processes in multiple address spaces, and memory-mapped streams for data 
movement between address spaces. These techniques reduce the the number of user/kernel 
interactions (system calls, signals, and preemptions). Compared with existing mechanisms, 
they can reduce scheduling and I/O overhead by a factor of 4 to 6. 



1. INTRODUCTION 

Support for digital audio and video as I/O media 
is an important direction of computer systems 
research. We call audio and video continuous media 
(CM) because they are perceived as continuous, in 
contrast with discrete media such as graphics. There 
are various ways to incorporate CM in computer sys- 
tems; in the integrated approach. CM data (digital 
audio and compressed digital video) is handled by 
user-level programs on general purpose operating 
systems such as Unix or Mach. 



On existing general purpose OSs, integrated CM 
applications can suffer from poor performance; ACME 
[4] is one such application. ACME is a user-level I/O 
server that provides shared, network-transparent 
access to devices such as video cameras, speakers, 
and microphones (see Figure 1). We have imple- 
mented a prototype of ACME for a Sun SPARCstation 
mnning SunOS 4.1. It suffers from timing errors and 
lost data when there is concurrent system activity, 
even though the hardware is easily able to handle the 
data rates (e.g., 64 Kb/sec audio data). The sender 
also cannot supply the low delay needed for a tele- 
phone conversation client. 

These problems are partly due to the overhead 

of user/kernel interaction mechanisms by which user- 
level programs invoke system functions such as CPU 
scheduling and I/O. This overhead includes 
user/kernel domain switches and mapping switches 



workstation 



file server 




user 



kernel 



A 


CME 
diem 


AC 
i/Os 


ME 
€rvcr 









DAC, 
speaker 



/ / I \\ 




display 



Figure 1 : Audio playback is a basic integrated CM ap- 
plication. The client reads CM data from a file and 
sends it to the ACME server (bold line). The client 
also provides a graphical interface for making selec- 
tions and controlling the playback parameters. 



between different user virtual address spaces. For 
example, the UNIX asynchronous I/O mechanism 
requires up to ten domain switches and two mapping 
switches to read a block of data. The expense of 
these operations can be amortized by hysteresis and 
increased granularity (techniques used in buffered I/O 
and pipes). For CM applications, however, these 
techniques may increase delay excessively. 

With the goal of better supporting integrated CM 
applications, we have designed OS mechanisms for 
scheduling and IPC. 

o Split-level scheduling and synchronization. In this 
approach each user virtual address space (VAS) 
contains multiple lightweight processes (LWPs). 
The scheduler is partitioned into user-level and 
kernel-level parts, which communicate via shared 
memory. The information in shared memory is 
; used to correctly prioritize LWPs in different VASs. 
^and avoid domain and mapping switches where 
• possible. Split-leve! scheduling can be used with 
^V^many scheduling policies; we discuss its use for 
deadline/workahead scheduling, a real-time policy 
designed for CM. 

o Memory-mapped streams. A memory mapped 
stream (MMS) is a shared-memory FIFO used for 
communicating CM data between user and kernel 
VASs. Once the MMS has been setup, no explicit 
kernel requests are needed to transfer data, and a 
minimal number of domain switches are needed 
for producer/consumer synchronization and I/O ini- 
tiation. 



In the next section we explain the process struc- 
ture of the ACME sen/er and the deadline/workahead 
scheduling policy in more detail. Sections 3 and 4 
describe the new mechanisms. Section 5 gives some 
performance estimates, and Section 6 discusses 
related work. 

2. PROCESS STRUCTURE AND SCHEDULING 
FOR CM APPLICATIONS 

To motivate subsequent sections, we sketch a 
typical CM application (the ACME I/O sen/er), and 
describe the deadline-workahead CPU scheduling pol- 
icy. 

2.1. The ACME Continuous Media I/O Server 

ACME (Abstractions for Continuous Media) [4] 
supports applications such as audio/video conferenc- 
ing, editing, and browsing. ACME allows its clients to 
create logical devices, associate them with physical 
I/O devices (video display or camera, audio speaker 
or microphone), and do I/O of CM data over CM con- 
nections (network connections carrying CM data). 
The data stream on a given CM connection may be 
multiplexed among different logical devices. ACME 
provides mechanisms for synchronizing different 
streams. 

The ACME server performs multiple concurrent 
activities, and it is convenient to structure it as a set of 
concurrent processes. Our prototype uses the 
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Figure 2: A CM application such as the ACME server 
consists of multiple processes sharing a single ad- 
dress space. Some of these processes handle 
streams of CM data, while others handle discrete 
events. 



following processes (see Figure 2): 

• For each CM connection, a network I/O process 
transfers data between an internal buffer and the 
network, it may do software processing (e.g., 
volume scaling for audio streams). 

• For each CM I/O device there is a device I/O pro- 
cess. For an output device, this process merges 
the data from the logical devices mapped to it and 
writes the resulting data to the device. 

• Event-handling processes handle non-real-time 

events such as commands from the window server 
and requests for CM connection establishment. 

The current implementation of ACME runs on 
the Sun SPARCstation. It is written in C++ and uses a 
preemptive lightweight process library. I/O is done 
using UNIX asynchronous I/O. The server handles 
telephone-quality (64 Kbps) audio I/O and video out- 
put, both compressed and uncompressed. 

2.2. Deadline/Workahead Scheduiing 

The Deadline/Workahead Scheduling (DWS) 
CPU scheduling policy is designed for integrated CM 
[2]. In the DWS model, a process that handles CM 
data is called a real-time process. There are two 
classes of non-real-time processes: interactive (for 
which fast response time is important) and back- 
ground. 

A real-time process handles a sequence of mes- 
sages each with a logical arrival time l(m), either 
derived from a timestamp in the data or implicit from 
its position in the stream. Each real-time process has 
a fixed logical delay bound, the processing of each 
message should be finished within this amount after 
its logical arrival. At a given time f, a real-time pro- 
cess is called critical if it has an unprocessed mes- 
sage m with l(m)<t (/.e., nfs logical arrival time has 
passed). Real-time processes that have pending 
messages but are not critical are called workahead 
processes. 

The DWS policy is as follows (see Figure 3). 
Critical processes have priority over all others, and are 
preemptively scheduled earliest deadline first (the 
deadline of a process is the logical arrival time of its 
first unprocessed message plus its delay bound). 
Interactive processes have priority over workahead 
processes, but are preempted when those processes 
become critical. Non-real-time processes are 
scheduled according to an unspecified policy, such as 
the UNIX time-slicing policy. The scheduling policy for 
workahead processes is also unspecified, and may be 
chosen to minimize context switching. 




Pi (critical) 



Pa (critical) 



P 3 (worlcahead) 



priority 



time 





critical 




interactive 




workahead 




background 



a) critical and workahead processes 



b) prioritizadon of 
process classes 



Figure 3: In the deadline/workahead scheduling 
(DWS) policy, each real-time process has a queue of 
pending messages. In example a), each message is 
shown as a rectangle whose left edge is its logical ar- 
rival time and whose right edge is its deadline. Pi and 
Pz are critical because they have a pending message 
whose logical arrival time is in the past. Processes 
are prioritized as shown in b). Critical processes are 
executed earliest deadline first; policies for other 
classes are unspecified. 



3. SPLIT-LEVEL SCHEDULING AND SYNCHRONI- 
ZATION 

CM applications are most easily programmed 
using multiple processes sharing a virtual address 
space (VAS). The two common multiprogramming 
techniques, lightweight processes (LWPs) and 
threads, each have advantages. LWPs are imple- 
mented purely at the user level, so context switches 
within a VAS is fast (on the order of tens of instruc- 
tions). However, LWPs in different VASs may not be 
prioritized correctly. On the other hand, threads in dif- 
ferent VASs can be correctly prioritized but context 
switches always involve an expensive user/kernel 
interaction. 

Split-level scheduling is a scheduler implemen- 
tation technique that combines the advantages of 
threads and LWPs: it minimizes user/kernel interac- 
tions while correctly prioritizing LWPs in different 
VASs. In the uniprocessor version of split-level 
scheduling\ multiple LWPs per VAS share a single 
thread. An LWP sleeps or changes its priority by cal- 
ling a user-level scheduler (ULS) (see Figure 4). The 

^ The technique is applicable to multiprocessor scheduling 
as well. For brevity we describe only the uniprocessor case. 



ULS checks whether its VAS still contains the globally 
highest-priority LWP; this is done by examining an 
area of nnemory shared with the kernel. If so, the 
LWP context switch is done without kernel inten/en- 
tion. Othen/vise, a kernel trap is done, and the kernei- 
level scheduler (KLS) decides which VAS should now 
execute, again based on information in shared 
memory. 

While split-level scheduling can be used with 
many scheduling policies, we focus on its implementa- 
tion for the deadline/workahead (DWS) policy 
described in the previous section. We also describe a 
related mechanism for efficient mutual exclusion 
between LWPs. For simplicity, we consider only the 
scheduling of real-time processes. It is straightfor- 
ward to handle interactive and background processes 
as well (a VAS could contain a mixture of process 
types), 

3.1. Client Interface to the Split-Level DWS 
Scheduler 

A user-level library provides the client interface 
to the split-level DWS scheduler. The library exports 
interfaces for creating and destroying LWPs. An LWP 
P has three scheduling parameters: a fixed delay 
bound (see Section 2.2), a critical time Cp (the logical 
arrival time of its next message) and a deadline Dp 
{Cp plus the delay bound). The library provides the 
following functions for scheduling LWPs: 

tiine_advance(TIME critical^time) ; 

An LWP P calls this when it finishes a message; the 
argument is the logical arrival time of the next mes- 
sage, time.advance ( ) updates Cp. and may yield 
the CPU. 

timed_sleep(TIME critical_tiine) ; 

An LWP calls this to suspend its execution until the 
given time; at this point It becomes runnable and Cp is 
set to the current time. This may be used by 
processes that do time-based output with no device 
synchronization {e.g.. slow video) or for rate-based 
flow control. 

IO_wait( DESCRIPTOR iodesc, 
TIME critical_tiine) ; 

An LWP calls this to wait for I/O to become possible 
on the given I/O descriptor representing a file, socket, 
I/O device or MMS (Section 4). When data arrives on 
the descriptor, the process becomes runnable and its 
Cp is set to the given value. 

inask_LWP_preeinption { ) ; 
uninask_LWP_preemption ( ) ; 

These calls bracket "critical sections" within which the 
calling LWP cannot be preempted by an LWP in the 
same VAS. 




Figure 4: Using split-level scheduling, the kernel-level 
scheduler decides which user VAS should execute, 
and each VAS has a user-level scheduler (ULS) that 
manages the LWPs in that VAS. In this example, the 
KLS chooses VAS $2 to run because it has the global- 
ly earliest deadline. The ULS in that VAS executes 
Ps, which has this deadline. User/kernel interactions 
can often be avoided: in this example, if Ps yields then 
the context switch to Pq (the next earliest deadline) 
can be done without a kernel call. 



3,2. Implementation of the Split-Level DWS 
Scheduler 

In this section we first describe the control and 
shared memory interfaces between the ULS and the 
KLS (see Figure 5). We then describe the implemen- 
tation of each level. We defer discussing synchroniza- 
tion issues (e.g., mutual exclusion on shared data 
stnjctures) until Section 3.3. 

3.2.1. User/Kernel Control Interface 

The control interface between a ULS and the 
KLS consists of system calls and user-intemjpts. The 
system call mechanism is the same as in UNIX-type 
systems: a trap instruction and return. The split-level 
scheduler needs one new system call: yieido yields 
the processor to another VAS. 

User-interrupts are like UNIX signals except that 
the handler does not end with a system call to reset 
th@ signal mask (hence there is one domain switch 
rather than three). Each ULS registers the addresses 
of its handlers during initialization. Three types of 
user-interrupts are used: iot_timer is delivered 
when a timer elapses, int_io_ready is delivered 
when I/O becomes possible on an I/O descriptor, and 
INT_RESUME is delivered when a user VAS resumes 
after being preempted. 
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Figure 5: The user-level and kernel-level parts of the 
split-level scheduler communicate using systetn calls, 
useMnterrupts, and through an area of shared 
memory. 



3.2.2. User/Kernel Shared Memory Interface 

The ULS for each VAS A shares a region of phy- 
sical memory with the kernel. This region consists of 
two parts: the usched area and the ksched area (see 
Figure 5). The usched area is written by the ULS and 
read by the KLS. It contains the following: 

Da: the minimum of Dp for critical processes 
P € /A , or -h» if there are none. (A runnable LWP P 
is critical if Cp < Tnow and workahead if Cp > Tnow). 
In other words, Da is the earliest deadline of a criti- 
cal LWPin>A. 

A runnable flag. TRUE if there are any runnable 
LWPs (critical or workahead) in the VAS. 

A table of workahead and sleeping LWPs P in A 
such that Dp <Da^ Each entry in the table contains 
the critical time and deadline of the LWP. 

For each 1/0 descriptor, a waitmgJorJO flag indi- 
cating whether an LWP is blocked on the descrip- 
tor, and if so the critical time and deadline of the 
LWP. 

Tnexi: the time at which the next int^timer user- 
interrupt should be delivered. 

The ksched area, written by the KLS and read by the 
ULS, contains the following: 

Tflow: the current real time as measured by a 
hardware clock. 

Da\ the earliest deadline of a critical LWP not in A . 

For each I/O descriptor, a reac/y_foc/0 flag to indi- 
cate that data has arrived on that descriptor. 



We use the following additional notation: 

Pa\ the highest priohty runnable LWP in If Da is 
finite then Pa is the earliest-deadline runnable criti- 
cal LWP. Othenwise, Pa is set to an arbitrary mnn- 
able LWP (the choice of P„ in this case depends 
on the policy for workahead processes, which we 
do not specify). • 

P *: the globally highest priority LWP. 
A : the VAS containing P . 

3.2.3. ULS Implementation 

The ULS of VAS A is responsible for scheduling 
LWPs in A. If the ULS detects from its ksched area 

that A^A ,\\ calls yieldo. Similarly, if the KLS 

detects from A's usched area that A^A* ,\\ preempts 
A. 

The ULS may need to preempt the currently run- 
ning LWP when the critical time of a sleeping LWP is 
reached or a non-running workahead LWP becomes 
critical. This requires an int_timer user-interrupt 
from the kernel. To reduce the number int^timer 
user-interrupt deliveries, the following policy is used 
(see Figure 6): 
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Figure 6: At a given time r«^. the ULS for a VAS A 
must have a pending int_timer user-interrupt for the 
earliest critical time of a sleeping or workahead pro- 
cess PeA such that Dp <Da. In this example, P3 is 
critical and P, and P2 are workahead. If P3 is still run- 
ning when Cp. arrives, P2 becomes critical and must 
preempt P3. On the other hand, Pt cannot preempt P3 
because its deadline is greater. Therefore a timer is 
needed for Cz but not C, . 



Let X be the set of sleeping and workahead 
LWPs P \r\ A such that Dp<Da. Let 
7'cntfca/ = "^'n(Cp: P e X). Then it is sufficient 
for the ULS to maintain a timer for Taivcat- 

In addition to the data in the usched area, the 
ULS maintains queues of sleeping, critical and worka- 
head LWPs. The implementations of 

timed_sleep( ) , tinie_advance ( ) , and IO_wait() 

are as follows. Each function inserts the calling LWP 
into the appropriate structure (the sleep queue, the 
workahead or critical queue, and an I/O descriptor 
respectively), then does the following (see Figure 7): 

(1) For each LWP P in the workahead and sleep 
queues such that Cp < T„owf insert P in the crit- 
ical queue. For each LWP P sleeping on an 
I/O descriptor for which the ready JorJO flag 
is set, insert P into the workahead or critical 
queues as appropriate. 

(2) Update Da in the usched area. 

(3) Update the usched area's table of sleeping 
and workahead LWPs P with Dp < Da and the 
list of LWPs waiting for I/O. 

(4) MA It A* then call yield ( ) , else. 

(5) Set T„e;rt=Tcrtfca/- Do a context switch to P>|. 
The handler for an int_timer user-internjpt 

moves the LWPs for which Cp<T„ow from the sleep 
queue to the critical queue. It then executes steps 2, 
3 and 5 above. The handler for an int.io.ready 
user-interrupt moves all LWPs for which the 
readyJorJO flag is set to the critical or workahead 
queue and executes steps 1-5 above. 

An iNT_RESUME user-intemjpt is delivered to a 
VAS when it resumes execution after having been 
preempted. Between when the VAS was preempted 
and Tnow, an indeterminate amount of time has 
elapsed. The same is true when the VAS returns from 
a yield 0 system call. In both cases, the ULS per- 
forms steps 1-5 above to update its state. 

3.2.4. KLS Implementation 

The KLS is responsible for updating Da in the 
ksched area of the currently executing VAS A. If in 

doing so it detects that A it a , it preempts A and 

switches to >\ . 

Changes to Dx can occur when a sleeping LWP 
wakes up or a workahead LWP becomes critical. 
Timers have to be set for these moments. KLS timer 
management is analogous to that of a ULS. The KLS 
maintains a timer for the eariiest Cp such that Dp < Da \ 
this is computed from the tables in the usched areas 
of all VASs not currently executing. If, when the timer 

expiresr.the current VAS A is no longer the KLS 
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Figure 7: In this example, a VAS contains LWPs 
Pi • • ■ P7. The current process, P4, has called 
tiined_sieep( ), and the ULS has inserted it in the 
sleeping queue. The ULS then does the following 
(see Section 3.2.3): it moves P2 to the critical queue, 
records F3 and P4 in the usched area, sets Da to De, 
and sets a timer for C4. Rnally, it does a context 
switch to Ps. 



preempts A and switches to the new A . Additionally, 
the kernel clock interrupt handler polls Tnext in the 

usched area of A , delivering an int^timer if neces- 
sary. 

The yield {) System call determines A*. It 

then computes Dj, writes it to A **s ksched area and 
updates the pending timer if necessary. Finally, it 

switches to A , either by returning from an eariier 
yield 0 System call, or by delivering an int_resume 
user-internjpt. 

The handler for an I/O completion interrupt 
examines the waitingJorJO flag for the correspond- 
ing descriptor. If it is set. the intermpt handler sets the 
readyJorJO flag in the ksched area of the VAS A 
containing the descriptor. Moreover, if the LWP wait- 
ing on that descriptor is critical and has the eariiest 
deadline, an int_io_ready user-interrupt is 
delivered, preempting the current VAS if necessary. If 

A^A , the handler updates Da in A's ksched area, 
depending on whether the waiting LWP is critical. 

3.3. Split-Level Synchronization 

ULS/KLS shared memory can be concun-ently 
accessed by multiple entities (LWPs, user-interrupt 
handlers and kernel interrupt handlers). We require 



mechanisms to synchronize access to this shared 
memory. By analyzing how specific shared data 
structures are accessed by different entities, we can 
obtain a set of specialized synchronization mechan- 
isms that minimize user/kernel interactions. 

First, ULS data structures such as the critical, 
workahead and sleeping queues are read and written 
by LWPs and user-interrupt handlers. To synchronize 
access to such structures it suffices to inhibit (or 
"mask") user-interrupts (since preemptive context 
switches within the VAS take place only in user- 
interrupt handlers, this inhibits LWP preemption as 
well). User-interrupt masking can also be used to 
implement mask_Lwp_preeinption( ) and 

un2nask_LWP_preeinption ( ) , which provide mutual 

exclusion for client-defined data structures, 

A technique called virtual user-interrupt masking 
provides user-interrupt masking without user/kernel 
interactions in the normal case. This technique uses a 
mas/c level in the usched area and a request flag in 
the ksched area. The request flag is a bitmap with 
one flag per user-intermpt type. To mask user- 
interrupts, the ULS increments the mask level. When- 
ever the kernel wants to deliver an interrupt and finds 
its mask level nonzero, it sets the corresponding bit in 
the request flag. When the ULS unmasks user- 
interrupts it decrements the mask level. If this returns 
to zero and the request flag is set, the ULS calls the 
appropriate handier to service the interrupt. 

Second, the tables of sleeping and workahead 
LWPs in the usched area are written by the ULS and 
read by the KLS. These tables are read by the KLS 
only while the VAS is preempted or has yielded. If a 
VAS is preempted while the ULS is writing the tables, 
the KLS sees inconsistent data. To prevent this, we 
need a VAS preemption masking mechanism. "Vir- 
tual" masking can also be used to implement this 
mechanism, using a preemption mask flag in the 
usched area and a preemption request flag in the 
ksched area. While the mask is nonzero, the VAS 
cannot be preempted by another VAS. Upon unmask- 
ing preemption, if the ULS finds the request flag set, it 

calls yieldO. 

Third, several items in the ksched area (D^, 
Tnow. ready Jor JO) are written by kernel interrupt 
handlers (clock, I/O) and read by the ULS. It is possi- 
ble to do virtual masking of kemel interrupts, but this 
has the drawback of requiring a system call to service 
interrupts that occur while kemel interrupts are 
masked. By exploiting specific properties of these 
Items, simpler solutions are possible. For example, if 
reading or writing a single word is atomic, then a data 
stmcture consisting of a single word (e.g., the 
readyJorJO flag in an I/O descriptor) requires no 
synchronization mechanism. For multi-word quanti- 
ties such as and Tno^ that are monotonically 



increasing or decreasing, a consistent value can be 
obtained by repeatedly reading the quantity until two 
successive reads result in the same value. 

Finally, several items in the usched area (e.g., 
Tr^Qxty runnable and v/aitingJorJO) are read by 
kernel interrupt handlers and written by the ULS. 
Again, we can exploit specific properties of these 
items to achieve simple synchronization mechanisms. 
Single-word flags require no synchronization if word 
access is atomic. For multi-word quantities {Da and 
Tnexf) the ULS masks preemption during access. If a 
kernel interrupt handler finds that preemption is 
masked, it assumes that a multi-word quantity is 
inconsistent and takes appropriate action. For 
instance, if Tnext is inconsistent, the clock interrupt 
handler delays checking for int_timer delivery until 
the next clock tick; if Da is inconsistent, the preemp- 
tion request flag is set. 

3.4. Discussion 

Split-level scheduling introduces new protection 
problems: a malicious or incorrect program may keep 
VAS preemption masked indefinitely, or it may exe- 
cute indefinitely without changing its deadline. Either 
of these actions would stan/e all other VASs. A 
"watchdog timer" can be used to detect such condi- 
tions, and to kill or demote the offending process. 

Deadline/wori<ahead scheduling has both "hard" 
and "soft" variants: the distinction is whether or not 
processes reserve CPU capacity in advance. In the 
hard variant, each new LWP specifies its wori<load 
(message rate and CPU time per message). The KLS 
conducts a schedulability test Xo determine whether 
the workload can be accommodated and if so, with 
what logical delay bound. This test involves a simula- 
tion under worst-case load, and is described in [2]. In 
the soft variant, no such screening is done, and it is 
possible for the system to fall behind schedule. 

SLS is not restricted to deadline/workahead 
scheduling; it can be adapted to other policies, such 
as static priorities or usage-based timesharing poli- 
cies. The policy dictates the contents of ULS/KLS 
shared memory; in general, the usched area contains 
the highest priority among mnnable LWPs in the 
address space, while the ksched area contains the 
highest priority among runnable LWPs in other 
address spaces. 

4. MEMORY-MAPPED STREAMS 

Each real-time LWP in a CM application handles 
a stream of CM data. The source and sink of each 
stream are typically I/O devices, and CM data must be 
moved to or from the kernel address space. A 
mechanism for this user/kernel IPC has three com- 
ponents: 



• Control and synchronization: This includes I/O 
initiation and producer/consumer synchronization. 

• Data location transfer: If the addresses of data 
buffers in the user VAS change, they must be 
transferred from the user to the kernel (if the user 
determines the buffer addresses) or vice versa. 

• Data transfer: The actual transfer of data, perhaps 
by copying or VM remapping. 

Traditional user/kernel IPC mechanisms require 
a user/kernel interaction for one or more of the above 
components in every I/O operation. For example, the 
UNIX reado System call perfonns all three com- 
ponents. UNIX asynchronous I/O uses the reado 
system call for data and data location transfer and the 
siGio signal and select o system call for control 
and synchronization. 

Memory mapped streams (MMS) are a new 
class of IPC mechanisms for stream-oriented 
user/kernel IPC^. An MMS uses shared memory for 
control and synchronization. MMSs may use any of a 
number of techniques for data location transfer; all 
these use shared memory to hold either the data itself 
or the data location (with each technique one or more 
data transfer mechanisms are possible, see Section 
4.3). This combination of shared memory mechan- 
isms reduces or eliminates user/kernel interactions in 
I/O operations. 

4.1. Client Interface to Memory-Mapped Streams 

The client interface to MMS consists of the foi* 
lowing library routines: 

d = MMS_create(fd, buffer_size, ...); 
MMS_read(d, nbytes) ; 
MMS_write(d, nbytes); 

MMS_create() creates a new MMS, returning a 
descriptor. Fd identifies the data source or sink (net- 
work connection, disk file, etc.); the data direction 
(read or write) is implicit. Buffer^size is the size of the 
MMS buffer.^ Additional arguments may be needed for 
the data transfer structure. MMS^reado blocks until 
nbytes Of data are available, and MMs^writeo 
blocks until nbytes of data can be written to the 
buffer. 



^ The basic technique of MMS (shared-memory synchroniza- 
tion structures) can also be used for user/user IPC. We describe 
only the user/kemet case here. 

^ Streams in which a storage device sources or sinks data 
typically have large end-to-end delay bounds (e.^., a second or 
more), so buffering may be used to increase the system efficiency 
and responsiveness. Streams that are part of an Inter-human 
conversation or conference have low end-to-end delay bounds 
(tens of milliseconds) must use smaller buffers. The buffer size 
may change dynamically: for example, the ACME audio output pro- 
cess must use a small buffer if any of the streams it Is currently 
handling is part of a conversation; othenwise it can use a large 
buffer. 



4.2. Synchronization and I/O Initiation 

MMS_create() allocates and initializes a syn- 
chronization structure in an area of memory shared 
between user and kernel. For concreteness, we dis- 
cuss the synchronization structure and mechanism for 
the case when a user LWP (scheduled by a split-level 
scheduler) reads CM data from an MMS. The syn- 
chronization structure contains the following data: 

The buffer size. 

N^aa- the number of bytes read so fan this is 
updated by the LWP. 

Nwnfe: the number of bytes written so far; this is 
updated by the kernel. The buffer is empty when 
Nreaa = A/wrte. and full whon they differ by the buffer 
size. 

Active: a flag, maintained by the kernel. If false, 
further I/O must be initiated by a request from the 
user process'*. 

Bwakeup'- if the data level in the MMS is greater than 
this number, the interrupt handler sets the 
readyJorJO flag in the MMS descriptor and 
delivers an int_io_ready if necessary. 

Sstart- this is set by the kernel. A system call to ini- 
tiate I/O must be made if the device is not active 
and the data level falls below this value. 

Hysteresis for I/O initiation is controlled by the 
Bstart parameter. Hysteresis for process wakeup is 
effected by appropriately setting B^eup- The DWS 
policy for workahead processes also implicitly controls 
wakeup hysteresis. 

The algorithm for MMS.reado is as follows: 

MMS_read(d, n) { 

mask_user__interrupts ( ) ; 
BwakBup - n; 

waiting_for_10 = TRUE; 

if (w - N„aa < n) 

I0_wait() ; 
if {(w - N„aa < Bstart) iactive) 

initiate_IO( ) ; 
waiting_for_10 = FALSE; 
Nf^ad += n; 

unmask_user_interrupts ( ) ; 

} 



* A CM I/O device such as a D/A converter Is always active: it 
continually does I/O, periodically generating Interrupts when a 
block of data has been input or output. A file system is generally 
passive; I/O must be initiated by a system call; this call may trigger 
a chain of operations via I/O completion interrupts, but eventually 
another system call is needed to restart I/O. A passive stream, 
such as a file, can be made active by using a time-based kernel ac- 
tivity (e.g., polling) to restart I/O without inten/ention from the client. 
Incoming network connections may be either active or passive 
depending on the transport protocol used. 



This code executes at user level, so I/O inter- 
rupts cannot be masked (mask_user_interrupts() 
merely inhibits the delivery of int_io_ready user 
interrupts; see Section 3.3). There is a potential race 
condition if an I/O interrupt occurs between getting 
Ny,rit9 and calling lo.waito. This race condition is 
avoided, however, by setting waiting_for_io; if an 
1/0 interrupt occurs during the critical period, it will 
simply set the int_io_ready request flag and the 
descriptor's ready_for_io flag. The ULS will check 
these flags when it unmasks user interrupts in 
io_wait{), and will awaken the LWP that called 
MMS_read( ) if necessary. 

The kernel interrupt handler for an n-byte read 
operation completion does tiie following: 

append data to data transfer structure; 
update data location transfer 

structure if needed; 
Nwhte += n; 
if (waiting_for_10) 

if (A/^e - Nrsad > B^^keup) i 

ready_for_IO = TRUE; 

if {Cp<Tno^ and Dp<D) 

deliver INT_IO_READY 
user interrupt to VAS 

} 

Synchronization is simpler in this case because the 
LWP cannot preempt the interrupt handler 

4.3. Data and Data Location Transfer 

The mechanisms for transferring data Iqcation, 
and the data itself, are largely independent of control 
and synchronization. Some possibilities are: 

• Data is passed in pages of physical memory that 
are statically shared between kemel and user. 
Data location is implicit. Data copying may still be 
necessary: for user writing, the kernel may need a 
copy of the data (e.g. for retransmission) after the 
page has been reused; for user reading, the client 
may need to write the data to another MMS. 

• Data is passed in a fixed range of virtual pages 
that are mapped dynamically to physical pages. 
Data location is implicit, and copying can be 
avoided in some cases. 

• The kernel and user share an array of "message 
descriptors" that contain pointers to blocks of data. 
Data may be transferred by remapping, by copy- 
ing, or by copy-on-write. 

The optimal choice of mechanism depends on 
factors such as remapping cost and message size. 
The control and synchronization mechanism 
described earlier may have to be slightiy modified in 
some cases; for example, the N^^a and N^te variables 
may need to be defined in terms of pages or mes- 
sages instead of bytes. 



5. PERFORMANCE 

In this section we show by example how split- 
level scheduling and memory-mapped streams reduce 
the number of user/kernel interactions. We then com- 
pare the performance of split-level scheduled LWPs 
and MMSs with other alternatives for scheduling and 
I/O. 

5.1. Example Scenario 

To see how split-level scheduling and MMSs 
together reduce the number of user/kernel interac- 
tions, consider the following scenario (see Figure 8). 
An application (say the ACME server) has two real- 
time LWPs and one background LWP: a device I/O 
LWP Pd for audio output, a network I/O LWP read- 
ing from a CM connection, and an event-handling 
LWP P£, Po has an MMS for output to the audio out- 
put device, which interrupts every 30 ms. This MMS's 
buffer is small (e.g., because the stream it is handling 
is part of a low-delay conversation). P^ has an input 
MMS from its network connection; I/O is passive and 
\he MMS buffer is large (e.g., because the data is 
coming from a file). The LWPs are scheduled using a 
split-level scheduler. A typical sequence is as follows. 

(1) At time 10 ms Po completes processing a 
block of audio data and calls MMs_write(), 
which calls lo^waito since the MMS buffer 
is full. P^, is now the highest priority runnable 
LWP, so the ULS switches to it 

(2) P^ repeatedly calls MMS_read( ) to wait for a. 
message, processes the message and calls 

time_advance(). At time 27 ms, 
MMS^reado sses that the MMS buffer is 
empty, and calls io_wait { ) . P^ is now the 
highest priority LWP. so the ULS switches to 
it 
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Figure 8: Split-level scheduling and MMSs reduce 
user/kernel interactions. Po and P^ do several I/O 
operations, but there is only a single user/kernel in- 
teraction. 



(3) At time 30 ms an audio output interrupt 
occurs. Po, which was earlier suspended 
because the MMS buffer was full (see step 1), 
now becomes critical, so the interrupt handler 
delivers an int_io_ready. This causes the 
ULS to preempt and switch to Pq. 

(4) Data arrives for at time 35 ms. Because 
P/v has worked ahead into the CM stream, 
Dp, > Dp, and P/v cannot become the highest 
priority LWP in the VAS. Therefore, the inter- 
oipt handler only sets the ready JorJO flag 
and does not deliver an int_io_ready. 

(5) At time 40 ms, Pq completes the message and 

calls MMS_write(), which calls lO.waitO 

(see step 1 above). From the readyJorJO 
flag in P^s MMS descriptor, the ULS finds that 
the LWP is runnable and also the highest 
priority LWP. So, the ULS switches to P^. 

In this scenario, the only user/kernel interaction 
is the user interrupt at time 30 ms. No system calls for 
I/O or scheduling are needed. An int_io_ready at 
time 35 ms is also eliminated. 

5.2. Performance Evaluation 

In this section we compare the following altema- 
tives for stmcturing CM applications: 

(1 ) Split-level scheduled LWPs (SLS-LWPs) using 
MMSsforl/0. 

(2) Threads using separate system calls for 
scheduling and I/O. ^ 

(3) LWPs without split-level scheduling (pure 
LWPs) using UNIX asynchronous, non- 
blocking I/O. In this case, if an LWP does not 
find data available when it does a non-blocking 
read( ) , it has to wait for a signal and then do 
a select ( } before calling read ( ) again. 

We have implemented prototypes of split-level 
scheduling and memory-mapped streams, and meas- 
ured the CPU times of their basic operations on a 
DECstation 3100 (a 14 MIPS machine representative 
of current RISC workstations). For the other 
approaches, we measured scheduling and I/O syn- 
chronization costs on a DECstation 3100 running 
Mach 2.5. 

Consider a thread or LWP that reads a CM mes- 
sage from an I/O descriptor, processes the message, 
and then changes its deadline to that of the next mes- 
sage in the stream. Table 1 shows the total schedul- 
ing and I/O synchronization overhead per message for 
various scenarios. 

The times shown in Table 1 are a significant 
fraction of typical CM message processing costs. For 
instance, scaling a 2K block of 8-bit audio samples 



Scheduling and I/O scenarios Overhead (in ^s) 

SLS-LWP reads message from MMS 

and then calls time_advance{} 17 
SLS-LWP calls I0_wait() (because MMS was empty) 

and is scheduled by INTJO^READY 67 
SLS-LWP calls timed_sleep() 

and is awakened by INT_TIMER 1 32 

Thread does a system call to read 

the next message and another 

to change its deadline 1 45 

Pure LWP does a system call to read 
the next message and another 

to change the process deadline 129 
Pure LWP does a non-blocking read(), 
later receives a signal. 

does a sefect() and read() to get the message, 
and finally a system call 

to change the process deadline 384 

Table 1: Scheduling and I/O synchronization over- 
heads per message for different scenarios. 



takes 1.0 ms, and mixing two 2K blocks takes 1.1 ms. 
Thus, the scheduling and I/O synchronization over- 
head using threads and pure LWPs ranges from about 
1 5-25% of the total message processing time. 

Rgure 9 shows scheduling and I/O synchroniza- 
tion overhead as a function of message rate for a par- 
ticular workload: an ACME sen/er simultaneously out- 
putting one video stream and two audio streams and 
inputting an audio stream and distributing it on two ChA 
connections. Thus, the server has five (workahead) 
network I/O processes and three (periodic) device I/O 
processes. 

Threads incur 4 times the overtiead of SLS- 
LWPs at all message rates; pure-LWPs are 6 times as 
expensive as SLS-LWPs. Pure LWPs and threads 
incur CPU overheads of 33% and 23% at 200 mes- 
sages per second. This message rate is realistic for 
some low-delay applications which need an end-to- 
end delay on the order of 10 ms (200 message/sec 
represents a packetization delay alone of 5ms). More- 
over, such a high message rate may also be achieved 
instantaneously by moderate-delay applications when 
they are working ahead. 

6. RELATED WORK 

The work described in this paper is related to 
several directions of current OS research. 
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Figure 9: Scheduling and I/O synchronization over- 
head for an ACME sen/er as a function of message 
rate. The sen/er workload consists of 5 (workahead) 
network I/O processes and 3 (non-workahead) device 
I/O processes. 



• User-level functionality. Modem operating sys- 
tems such as Amoeba, Chorus, and Mach shift 
functionality from kernel to user level to improve 
software structure. In contrast, our work shifts 
functionality to user level to increase perfomnance. 

• Asynchronous communication. Most existing 
operating systems use request/reply communica- 
tion; examples include UNIX-type system calls, 
RPC, and object invocation. This paradigm is not 
well-suited to continuous media (more generally, it 
may not be well-suited to future distributed sys- 
tems in which speed-of-light delays dominate 
throughput limits). MMSs provide efficient local 
asynchronous communication. Example of related 
work include the asynchronous RPC proposed by 
Gifford [8] and the dataflow model of Synthesis 
[10]. 

• Efficient local data transfer. In UNIX-type sys- 
tems, I/O and IPC performance is limited by the 
overhead of data copying. Systems such as Mach, 
DASH and Topaz have attacked this problem 
using techniques such as VM remapping and 
shared memory [11,12,1 4]. The MMS mechanism 
is complementary to this work; it attacks the over- 
head of control rather than data movement. 

SLS is closely related to recent work on mul- 
tiprocessor operating system support for parallelism, 



including the Psyche multiprocessor system [13] and 
scheduler activations [3]. In Psyche. ULSs schedule 
LWPs on kernel-supported threads, the kernel 
notifies the user VAS of events, such as blocking 
cross-domain invocations, that affect the ULS. 
User/kernel shared memory is used to efficiently com- 
municate LWP identifiers and to request timer user- 
intermpts. A scheduler activation is a thread-like exe- 
cution context in which an LWP can run. As in 
Psyche, user-interrupts notify user VASs of scheduling 
related events. Scheduler activations do not use 
shared memory to communicate scheduling informa- 
tion between ULSs and the kernel. 

In these two approaches, however, the kernel 
and the ULS scheduling policies are independent. As 
a result, the approaches cannot correctly prioritize 
LWPs across threads that may be mnning in different 
address spaces that are contending for the the same 
processor. They also cannot exploit policy-specific 
infomnation (e.g., LWP priorities) to reduce user-kemel 
interactions. 

For CM applications, the CPU scheduling 
approach of Synthesis [101 represents an altemative 
to split-level scheduling. The Synthesis model is 
based on a rate-control feedback. Processes make 
no calls to indicate their temporal progress; instead, 
the kernel adjusts time-slice quanta based on queue 
lengths. This approach is well-suited to some situa- 
tions (e.g.. audio DSP with little slack CPU time). 

The deadline/workahead scheduling policy is 
derived from the earliest-deadline-first policy [9]. but 
differs in its allowance for workahead. 

In Symunix II [7], parallel applications are imple- 
mented as a collection of UNIX processes communi- 
cating through shared memory. Processes use virtual 
preemption masking while holding short-duration 
busy-waiting locks and virtual signal masking while 
updating shared memory. Unlike virtual user-interrupt 
masking, virtual signal masking requires a new sys- 
tem call to handle pending signals after unmasking 
interrupts. 

Like MMSs, DEMOS links [5] can have associ- 
ated shared memory to transfer data between address 
spaces. However, this approach does not reduce 
control overhead; synchronization is necessary after 
each transfer. 

MMSs differ from memory-mapped files in 
several ways. A CM stream may be larger than a 
VAS, and an MMS need not contain the entire CM 
stream; since CM streams are accessed sequentially, 
a small circular buffer suffices. MMSs avoid the over- 
head of page faults. Also, since data is "released" 
explicitly, page replacement algorithms are not 
needed. 



Finally, the URPC mechanism developed by 
Bershad [6] uses shared memory to reduce kernel 
interaction in local client/server IPC on shared- 
memory multiprocessors. This is similar in spirit to 
MMS, though the setting is different. 

7. CONCLUSION 

Existing operating systems incorporate design 
principles that are contrary to the needs of applica- 
tions directly handling real-time streams of continuous 
media data (digital audio and video): 

• The request/reply paradigm (the basis of central- 
ized systems as well as R PC-based and object- 
oriented distributed systems) is non-optimal for 
stream-oriented CM data, 

• Assumptions about temporal locality and delay 
tolerance of data accesses leads to the use of 
caching and buffering, which are often inappropri- 
ate for CM data. 

• Scheduling policies in current systems have the 
goals of fairness, maximum system throughput, 
and fast interactive response. CM applications 
have real-time requirements that may conflict with 
these goals. 

Starting with the goal of supporting CM applica- 
tions, we have developed two interrelated mechan- 
isms, split-level scheduling and memory-mapped 
streams, for scheduling and IPC. We have described 
their use in a typical CM application (the ACME 
server) and have compared their performance with 
that of the analogous mechanisms in UNIX. They 
improve performance by reducing the number of 
user/kemel interactions. 

Split-level scheduling is most effective when 
switches between LWPs within a VAS are more fre- 
quent than switches between VASs. This is typical of 
CM systems when there is at most one VAS with low- 
delay processes (such as ACME'S device I/O 
processes) that require CPU time at frequent inten/als. 
To best exploit split-level scheduling, the I/O server 
should be the only application run on the workstation. 
CM playback and record applications have only high- 
delay processes; hence compute servers and data 
sen/ers may run multiple applications of this type and 
still benefit from split-level scheduling. 

These mechanisms are applicable for purposes 
other than CM. For example, memory-mapped 
streams could be used for access to a sequential disk 
file or a network stream connection. Process control 
applications (e.g., [1]) have scheduling requirements 
similar to those of CM. Split-level scheduling could be 
used with a time-slicing policy for a situation where a 
VAS contains both interactive and background 
processes. More generally, the mechanisms may be 
useful in any situation where the rate of I/O and 



scheduling operations, and the cost of user/kernel 
interactions, are high. 
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